Overview

Dataset statistics

Number of variables5
Number of observations6040
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory236.1 KiB
Average record size in memory40.0 B

Variable types

Numeric3
Categorical2

Alerts

Zip-code has a high cardinality: 3439 distinct values High cardinality
Age is highly correlated with OccupationHigh correlation
Occupation is highly correlated with AgeHigh correlation
UserID is uniformly distributed Uniform
UserID has unique values Unique
Occupation has 711 (11.8%) zeros Zeros

Reproduction

Analysis started2022-07-14 02:31:40.296363
Analysis finished2022-07-14 02:33:13.463592
Duration1 minute and 33.17 seconds
Software versionpandas-profiling v3.2.0
Download configurationconfig.json

Variables

UserID
Real number (ℝ≥0)

UNIFORM
UNIQUE

Distinct6040
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3020.5
Minimum1
Maximum6040
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size47.3 KiB
2022-07-13T22:33:13.508486image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile302.95
Q11510.75
median3020.5
Q34530.25
95-th percentile5738.05
Maximum6040
Range6039
Interquartile range (IQR)3019.5

Descriptive statistics

Standard deviation1743.742145
Coefficient of variation (CV)0.5773024812
Kurtosis-1.2
Mean3020.5
Median Absolute Deviation (MAD)1510
Skewness0
Sum18243820
Variance3040636.667
MonotonicityStrictly increasing
2022-07-13T22:33:13.555595image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
11
 
< 0.1%
40241
 
< 0.1%
40331
 
< 0.1%
40321
 
< 0.1%
40311
 
< 0.1%
40301
 
< 0.1%
40291
 
< 0.1%
40281
 
< 0.1%
40271
 
< 0.1%
40261
 
< 0.1%
Other values (6030)6030
99.8%
ValueCountFrequency (%)
11
< 0.1%
21
< 0.1%
31
< 0.1%
41
< 0.1%
51
< 0.1%
61
< 0.1%
71
< 0.1%
81
< 0.1%
91
< 0.1%
101
< 0.1%
ValueCountFrequency (%)
60401
< 0.1%
60391
< 0.1%
60381
< 0.1%
60371
< 0.1%
60361
< 0.1%
60351
< 0.1%
60341
< 0.1%
60331
< 0.1%
60321
< 0.1%
60311
< 0.1%

Gender
Categorical

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size47.3 KiB
M
4331 
F
1709 

Length

Max length1
Median length1
Mean length1
Min length1

Characters and Unicode

Total characters6040
Distinct characters2
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowF
2nd rowM
3rd rowM
4th rowM
5th rowM

Common Values

ValueCountFrequency (%)
M4331
71.7%
F1709
 
28.3%

Length

2022-07-13T22:33:13.599532image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-07-13T22:33:13.639110image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
ValueCountFrequency (%)
m4331
71.7%
f1709
 
28.3%

Most occurring characters

ValueCountFrequency (%)
M4331
71.7%
F1709
 
28.3%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter6040
100.0%

Most frequent character per category

Uppercase Letter
ValueCountFrequency (%)
M4331
71.7%
F1709
 
28.3%

Most occurring scripts

ValueCountFrequency (%)
Latin6040
100.0%

Most frequent character per script

Latin
ValueCountFrequency (%)
M4331
71.7%
F1709
 
28.3%

Most occurring blocks

ValueCountFrequency (%)
ASCII6040
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
M4331
71.7%
F1709
 
28.3%

Age
Real number (ℝ≥0)

HIGH CORRELATION

Distinct7
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean30.63923841
Minimum1
Maximum56
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size47.3 KiB
2022-07-13T22:33:13.672741image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile18
Q125
median25
Q335
95-th percentile56
Maximum56
Range55
Interquartile range (IQR)10

Descriptive statistics

Standard deviation12.89596173
Coefficient of variation (CV)0.4208969412
Kurtosis-0.2908100824
Mean30.63923841
Median Absolute Deviation (MAD)7
Skewness0.2427000756
Sum185061
Variance166.3058289
MonotonicityNot monotonic
2022-07-13T22:33:13.703494image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%)
252096
34.7%
351193
19.8%
181103
18.3%
45550
 
9.1%
50496
 
8.2%
56380
 
6.3%
1222
 
3.7%
ValueCountFrequency (%)
1222
 
3.7%
181103
18.3%
252096
34.7%
351193
19.8%
45550
 
9.1%
50496
 
8.2%
56380
 
6.3%
ValueCountFrequency (%)
56380
 
6.3%
50496
 
8.2%
45550
 
9.1%
351193
19.8%
252096
34.7%
181103
18.3%
1222
 
3.7%

Occupation
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct21
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean8.146854305
Minimum0
Maximum20
Zeros711
Zeros (%)11.8%
Negative0
Negative (%)0.0%
Memory size47.3 KiB
2022-07-13T22:33:13.743074image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q13
median7
Q314
95-th percentile19
Maximum20
Range20
Interquartile range (IQR)11

Descriptive statistics

Standard deviation6.329511491
Coefficient of variation (CV)0.7769270512
Kurtosis-1.21414437
Mean8.146854305
Median Absolute Deviation (MAD)5
Skewness0.3382981095
Sum49207
Variance40.06271572
MonotonicityNot monotonic
2022-07-13T22:33:13.789570image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=21)
ValueCountFrequency (%)
4759
12.6%
0711
11.8%
7679
11.2%
1528
 
8.7%
17502
 
8.3%
12388
 
6.4%
14302
 
5.0%
20281
 
4.7%
2267
 
4.4%
16241
 
4.0%
Other values (11)1382
22.9%
ValueCountFrequency (%)
0711
11.8%
1528
8.7%
2267
 
4.4%
3173
 
2.9%
4759
12.6%
5112
 
1.9%
6236
 
3.9%
7679
11.2%
817
 
0.3%
992
 
1.5%
ValueCountFrequency (%)
20281
4.7%
1972
 
1.2%
1870
 
1.2%
17502
8.3%
16241
4.0%
15144
 
2.4%
14302
5.0%
13142
 
2.4%
12388
6.4%
11129
 
2.1%

Zip-code
Categorical

HIGH CARDINALITY

Distinct3439
Distinct (%)56.9%
Missing0
Missing (%)0.0%
Memory size47.3 KiB
48104
 
19
22903
 
18
55104
 
17
94110
 
17
55455
 
16
Other values (3434)
5953 

Length

Max length10
Median length5
Mean length5.058112583
Min length5

Characters and Unicode

Total characters30551
Distinct characters11
Distinct categories2 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique2293 ?
Unique (%)38.0%

Sample

1st row48067
2nd row70072
3rd row55117
4th row02460
5th row55455

Common Values

ValueCountFrequency (%)
4810419
 
0.3%
2290318
 
0.3%
5510417
 
0.3%
9411017
 
0.3%
5545516
 
0.3%
5510516
 
0.3%
1002516
 
0.3%
9411415
 
0.2%
5540815
 
0.2%
0213815
 
0.2%
Other values (3429)5876
97.3%

Length

2022-07-13T22:33:13.835915image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
4810419
 
0.3%
2290318
 
0.3%
5510417
 
0.3%
9411017
 
0.3%
5545516
 
0.3%
5510516
 
0.3%
1002516
 
0.3%
9411415
 
0.2%
5540815
 
0.2%
0213815
 
0.2%
Other values (3429)5876
97.3%

Most occurring characters

ValueCountFrequency (%)
05293
17.3%
14140
13.6%
23272
10.7%
43119
10.2%
53091
10.1%
32663
8.7%
92571
8.4%
62229
7.3%
72109
 
6.9%
81998
 
6.5%

Most occurring categories

ValueCountFrequency (%)
Decimal Number30485
99.8%
Dash Punctuation66
 
0.2%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
05293
17.4%
14140
13.6%
23272
10.7%
43119
10.2%
53091
10.1%
32663
8.7%
92571
8.4%
62229
7.3%
72109
 
6.9%
81998
 
6.6%
Dash Punctuation
ValueCountFrequency (%)
-66
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common30551
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
05293
17.3%
14140
13.6%
23272
10.7%
43119
10.2%
53091
10.1%
32663
8.7%
92571
8.4%
62229
7.3%
72109
 
6.9%
81998
 
6.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII30551
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
05293
17.3%
14140
13.6%
23272
10.7%
43119
10.2%
53091
10.1%
32663
8.7%
92571
8.4%
62229
7.3%
72109
 
6.9%
81998
 
6.5%

Interactions

2022-07-13T22:32:54.644496image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:31:41.874536image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:32:33.630018image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:33:13.093147image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:32:15.323651image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:32:54.488212image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:33:13.160658image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:32:24.457015image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-07-13T22:32:54.564437image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Correlations

2022-07-13T22:33:13.869619image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2022-07-13T22:33:13.292888image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
A simple visualization of nullity by column.
2022-07-13T22:33:13.401923image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

UserIDGenderAgeOccupationZip-code
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
56F50955117
67M35106810
78M251211413
89M251761614
910F35195370

Last rows

UserIDGenderAgeOccupationZip-code
60306031F18045123
60316032M45755108
60326033M501378232
60336034M251494117
60346035F25178734
60356036F251532603
60366037F45176006
60376038F56114706
60386039F45001060
60396040M25611106